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Abstract 

We derive explicit Bayesian nonparametric analysis for a species sampling model with 
finitely many types of Gibbs form of type a = —1 recently introduced in Gnedin (2009). 
Our results complement existing analysis under Gibbs priors of type a G [0,1) proposed 
in Lijoi et al. (2008). Calculations rely on a groups sequential construction of Gibbs 
partitions introduced in Cerquetti (2008). 

1 Introduction 

In the species sampling problem a random sample is drawn from an hypothetically infinite 
population of individuals to make inference on the unknown total number of different species. 
A Bayesian nonparametric approach to this problem has been recently proposed in Lijoi et 
al. (2007, 2008) to derive posterior predictive inference on species richness for an additional 
sample under the assumption that the vector of multiplicities of different species observed is 
a random sample from an exchangeable Gibbs partition of type a G [0, 1). 

Exchangeable Gibbs partitions of type a € [— oo, 1) (Gnedin and Pitman, 2006) are models 
for random partitions of the positive integers which extend the Ewens-Pitman two-parameter 
family and are characterized by a probability function (EPPF) with the following structure 

k 

p(ni, . . . , nfc) = Vn,k JJ(1 - a)n,-it> (1) 
i=i 

with weights {Vn,k) being solution to the backward recursion Vn,k = {n — ak)Vn+i,k + Ki,+i,fc+i 
with Vi^i = 1. Gnedin and Pitman even provide a constructive result for this class (cfr. Th. 
12) showing that each element arises as a unique probability mixture of extreme partitions 
which are 
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a) PD{a, ^\a\) partitions with ^ = 1, . . . , oo for a € [—00, 0), 

b) PD{0, 9) partitions with 9 G [0, 00) for a = 0, 

c) PK{pa\t) partitions with t € [0, 00) for a € (0, 1). 



Here PD(-, •) stands for the two-parameter Poisson-Dirichlet distribution (Pitman and Yor, 
1997) and PK{pa\t) for the conditional Poisson-Kingman distribution derived from the stable 
subordinator, (cfr. Pitman, 2003). 

Lijoi et al. (2008) establish general distributional results and properties for EPPFs belonging 
to subclass c) to make conditional predictions according to a Bayesian nonparametric pro- 
cedure. Focusing on class c) their analysis relies on the hypothesis that the total unknown 
number of species is so large to be assumed infinite. An assumption that may be unrealistic in 
concrete applications. Here we focus on a class of random partitions with finite but random 
number of different species, recently introduced in Gnedin (2009), which belongs to subclass 
a) and contribute, in view of possible future Bayesian implementations, deriving posterior 
predictive results analogous to that in Lijoi et al. (2008). 

First recall that for a < 0, = \a\£, and ^ = 1, 2, 3, . . . PD(a, ^|a|) model has EPPF (see 
e.g. Pitman, 1996, 2006) 

P(ni, . . . , n.) = - n(l + |a|)„^._,, (2) 
(4|a| + l)n-ltl jj^ 



or equivalently 



p{ni, ...,nk)= . J:^ T\i\a\)n,t^ 



and arises by the following sequential procedure. Given the partition of [n] in Kn = k blocks 
with occupancy counts n = (m, . . . , n^), the partition of [n + 1] is obtained through one-step 
prediction rules 

nj + \a\ |q|(C - k) 
Pjin) = — — — — and ^'o(n) = ——, 

n + \a\C S,\a\ +n 

where Pj{n) for j = 1, . . . ,k stands for the probability of randomly placing n + 1 in an old 
block j, while po(n) stands for the probability of n + 1 to form a new block k + 1. For a = —1 
the EPPF in (gD reduces to 



/ N - i)fc-it-i "rr 



or alternatively 

p{ni, ...,nk) = TT rijt 

(?)ntl jj^ 
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with one-step prediction rules 

Pn^) = , c ™^ ^'o(n) = — . 

n + 4 n + 4 

The corresponding limit frequencies {P(,i, • • • , -P^,^) have a stick-breaking representation 

P^j = Wj - Wi), with independent ~ Beta{2, ^ - i) 
1=1 

for i = 1 . . . , ^ and Beta (2,0) a Dirac mass at 1. 

2 Gnedin's model with finitely many types 

As from Gnedin and Pitman result, PD{a,S,\a\) models are extreme points of a convex set of 
Gibbs partitions of type q < 0, whose elements are in one to one correspondence with a set of 
mixing distributions over the set of the positive integers. Gnedin (2009) studies the particular 
model arising by mixing a PD{a, |a|^) model for a = —1 over ^ with 

= = 2(1^ 

for = 1,2,... and 7 S (0, 1) and shows it has weights 

^ (fc - 1)! (1 - j)k-i{l)n-k ^ {k- 1)! 7(1 - 7)fc-i (o. 

(n - 1)! (1 + 7)n-i {n-l)\{j + n-k)k' ^ ^ 

hence EPPF 

(fc- !)!(!- 7)fc-i(7)n-fc A , . . 

mi,...,n,)-^^_^^, (l + 7)n-i ^ ^ 

obtained by sequential construction with one-step prediction rules 

/ ^ (n-/c + 7)(nj + 1) r • 1 , , / N ^(^-7) 

Pj(n) = — ^ — ^ for J = 1, . . . , A; and po n = — — — -. 

n{n + 7) n(n + 7) 

Gnedin also derives analogous of the Ewens sampling formula and further results on the 
vector of frequencies and of exchangeable sequences induced by sampling from this model. 



For what follows it is worth to notice that Gnedin's one-step prediction rules may be 
equivalently expressed as m-steps prediction rules as in Cerquetti (2008, Prop. 3), which here 
we formulate as a group sequential random allocation of balls labelled 1,2... in a series of 
boxes. First from (jS)) we obtain 

_ {k + k* - 1)1 (1 - 'y)k+k*-l{'y)n+m-k+k* 

<'n+m,k+k* 



(n + m-1)! (1 + 7) 



n+m—l 
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then by basic properties of rising factorials specialization of the general formulas for Gibbs 
partitions easily follow. Start with box Bi^i with a single ball. Given the placement of the first 
group of n balls in a (ni, . . . ,nk) configuration in k boxes, the new group of m balls labelled 
{n+1,. . . , n+m} is: 

a) allocated in the old k boxes in configuration (mi, . . . , m^), for rrij > 0, Ylj=i''^j — 
with probability 

k 

b) allocated in k* new boxes in configuration (si, . . . , s^*), for Ylj=i^j = tu, 1 < k* < m, 
Sj > 1, with probability 

^,,(n) = (J^ ik-j),.{^f + n-kU.,, ^ 
{n)m {'y + n)m j-Ji 

c) s < m balls are allocatedat k* new boxes in configuration (si, . . . , s^.) and the remaining 
m — s balls in the old boxes in configuration (mi, . . . , m^) for YlY=i = m — s, \ < s < m, 
X^^^i Sj = s, ruj > 0, Sj > 1 with probability 

ik)k*{k--f)k*{-f + n-k)m-k*YJr "TT i /'7^ 

These m-steps prediction rules allows to readily obtain Bayesian posterior predictive dis- 
tributional results for the random partition induced by an additional m-sample from Gnedin's 
model. By exploiting the definition of central and non-central Lali numbers as particular case 
of central and non-central generalized Stirling numbers of the first kind, the results of next 
section are obtained specializing results in Lijoi et al. (2008) by means of expressions de- 
rived in Cerquetti (2008). See the Appendix for the relationship between generalized Stirling 
numbers and generalized factorial coefficients both central and non-central. For an explicit 
example of application of this kind of results see e.g. Section 4. in Lijoi et al. (2008). 



3 Posterior predictive analysis of Gnedin's model 

First notice that generalized Stirling numbers of the first kind 'S'~j|,'~" for a = —1 admit an 
explicit expression known as Lah numbers 

^-1,1 ^fn-l\nl 
"''^ \k-ljk\ 

which are connection coefficients defined by 
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hence, by an application of (10) in Gnedin and Pitman (2006), (see Gnedin, 2009, eq. (2)) the 
distribution of the number Kn of occupied boxes for Gnedin's model is readily calculated as 

-1,1 _ /^^\ (1 - 7)A:-l(7)n-fc 



Pr{Kn = k) = ^,,5-]'^ = 



(l+7)n-l 



Then recall the definition of non-central Lah numbers with parameter of non-centrality r 
(see e.g. Charalambides, 2005) which correspond to non central generalized Stirling numbers 
for a = —1 

.11 r n\ / n — r — V 



^^'^ k\\ n — k 

Now a Bayesian nonparametric posterior predictive analysis of the class (jl]) is readily de- 
rived. Notice that for the sake of generality here we mantain the treatment in terms of random 
allocation of balls in boxes. The obvious traslation in terms of random partition of individuals 
among species follows easily. 

Given the sufficiency of the number Kn of boxes induced by the basic sample, the joint 
distribution of the number S of balls allocated in new k* boxes in a specific configuration 
(si, . . . , Sfc*) given (ni, . . . , n^) is obtained marginalizing ^ with respect to (mi, . . . , nik) and 
by an application of the multinomial theorem for rising factorials (see the Appendix) as in 
Lijoi et al. (2008) (cfr. eq. (27) in Cerquetti, 2008), 

• • ■ • - - '=) - <- - ^>"- ft <«) 

The joint distribution of the number of new boxes Km and of the total number of balls 
falling in new boxes 5 given Kn (cfr. eq. (28) in Cerquetti, 2008) is given by 

PriKm = k*,S = s\Kn = k) = g^ ('-^^n!",^.r'^"^-" i"^) in+kU-s, f ^ ^ " 



{n)m {l + n)m \s)' '\k*) {k* 

(9) 

and arises by dS]) by summing over the space of all partitions of s elements in k* blocks and 
exploiting the definition of Lah numbers. 

Marginalizing ^ with respect to Km, a probability distribution for the total number S 
of balls in new boxes is obtained in terms of Lah numbers according to eq. (29) in Cerquetti 
(2008) and eq. (11) in Lijoi et al. (2008) 

Pr{S = s\n,,...,n,)={^^ (I^J^^T^ T.^ [l) (F^ (fc - t).- (7 + — k)m-k^ ■ 

(10) 

The probability distribution of the number Km of new blocks induced by the additional 
sample given the basic sample follows marginalizing Q with respect to 5 and exploiting the 



5 



definition of non-central Lah numbers with parameter of non centrality r = —(n + k). An 
application of eq. (4) in Lijoi et al. (2007) yields 

Pr[K^ - k - k) - ^—^ y^^J — . (11) 

The expected value of the number of new boxes in the m-sample conditioned to the basic 
sample, which provides the Bayes estimator for Km under quadratic loss function, results 

j^tj. ,N {k)n+m \^ f'm\ k* {k-j)k*{j + n-k)m-k* 

Notice that for ra — )■ 00, (llip agrees with the posterior result for the total number of boxes 
obtained in Gnedin (2009). In fact, by standard asymptotic T{n + a)/T{n + 6) ~ rf"'^ and 
recalling the definition of rising and falling factorials in terms of Gamma function 

T{a + h) T{a + l) 

(a);,t = ^, ^ and (a){,; - 



r(a) ' '"^ r(a-6+l)' 

equation (fTTj) reduces to 

-'I-. ^ {|^(. — (13) 
The posterior distribution obtained in Gnedin (2009), in terms of x = A; + A;*, is 

Pr(H = .\K^ = k)= .,_,ir. n(x - + n - ,) n (/ - 7) 

''^ '' 1=1 j=l l=k 

for l<k<n, >c>k and may be re-written as 

(n - 1)' 

= (fc-l)!(x + n-l)! ^" - '^'^-^^(^ - ^^--^^^^ + " - ^^'^^^ 
By substitution x: = k + k* the result easily follows 

p (^-1)! . ^ r(fc* + fc) (fc-7)fc- 

Pr(i^ = = fc) = ^^-^(7 + n - fc)^t r(fe + fc. + ,) fc., • 

The corresponding expected value results 

which expressed in terms x = k + k* yields 

FriK ^.^ iri-iy.{n + j-l)ki ^ {>f -7-1)1 1 

" ' {k-l)\ (A: -7-1)! A. (x-A;-l)!(x)„• 
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The distribution of the number S of balls in the new m-sample which belong to new boxes, 
given the number of boxes in the basic sample Kn and the number of new boxes Km, follows 
by an application of Eq. 12 in Lijoi et al. (2008) 

n/r, ,* ,x /s-l\/n + k + m- s-l\,/'n + k + m-l\ . ^. 

By Proposition 2. in Lijoi et al. (2008) the mean number of balls in the subsequent m 
sample in given by 

EiS\K^ = k)=m^^ = m'-^^^ (15) 
Vn,k n (n + 7) 

and by Proposition 4 in Lijoi et al. (2008) (cfr. also Corollary 10, in Cerquetti, 2008) the 
probability that the m new balls don't occupy a subset of {k — r) old boxes arises from ([7|) by 
summing over the ways to choose s balls from the m of the new group, by summing over the 
ways to partition s balls in a subset of k* boxes, and over the ways to allocate m — s balls in 
at most r old boxes and is equal to 

{k)l (/c-7)*(7 + n-fc)^_fc* /m\ 1 

fc^i (")m (7 + n)m \k*J {r + Uj + m)k*-m ' 

The conditional Gibbs structure characterizing Gnedin's model as from Proposition 3. in 
Lijoi et al. (2008), which may be obtained by the operation of deletion of the first k classes 
(Pitman, 2003) as clarified in Cerquetti (2008, Prop. 12) will be as follows 

, . , T{k + k*)T{k + k* -^)T{-f + n + m-k-k*) ^ , 

p[su ...,Sk* |m, nj ^^^^^ [^,)r{k*)kr{snk + k*- 7)r(7 + n + m-k-k*) f}^ 

Finally a Bayesian nonparametric estimator for the probability of ball n + m + 1th to fall 
in a new box given Kn = k is readily derived by eq. (6) in Lijoi et al. (2007) 

(fc)fc*+i (k - 7)a:*+i(7 + n- k)m-k* f m\m + n + k* - 1\ 



f)n,k ^ (fc)fc*+i {k - 7)fc»+i(7 + n- k)m-k* f m\ 

" ^^(P^m+l (7 + «)m+l V*) 



(n-1)! 



4 Appendix 



For n = 0, 1, 2, . . . , and arbitrary real x and /i, let {x)n^h denote the nth factorial power of x 
with increment h (also called generalized rising factorial) 



n-l 



{x)n^h ■■= x{x + h) ■ ■ ■ {x + {n - l)h) = J\{x + ih) = K'{x/h)n^, (17) 

i=0 

where {x)n^ stands for (2;)„-f-i, (x)hto = and {x)Q^^h = 1> and for which the following 
multiplicative law holds 

{x)n+r^h = {x)n^h{x + nh)rfh- (18) 
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Prom e.g. Normand (2004, cfr. eq. 2.41 and 2.45) a binomial formula also holds, namely 

{X + y)n\h = ^ [2j{x)kth{y)n-kth, (19) 

A:=0 ^ ^ 

as well as a generalized version of the multinomial theorem, i.e. 

p , p 

(j2'jUh= E ^,!.'!;^ ! n("^-)n,t.- (20) 

i=l nj>0,Y,nj=n ^ j=l 

We recall the notion of generalized Stirling numbers, (for a comprehensive treatment see 
Hsu and Shiue, 1998; see also Pitman, 2006). Por arbitrary distinct reals rj and /3, these are 
the connection coefficients 5^'^ defined by 

n 
k=0 

where {x)nih are generalized falling factorials and {x)ni-h = {x)n^h- Hence for rj = — 1, 
/3 = —a, and a G (— oo, 1), S""^'"" is defined by 

n 

{x)nn = Yl •5n,fc~"(^)fcta' (21) 
fc=0 

or specializing partial Bell polynomials as follows 

B„,.((i-«).-n)= E n(i-")...-n=iT E ri "":'r-^ -s.;^° 

{Ai,...,Afe}eP[';j *=1 (ni,...,nfc)«=l 

(22) 

Referring to formulas in Lijoi et al. (2007, 2008) it is convienent to recall that their 
treatment is in terms of generalized factorial coefficients, which are the connection coefficients 
C^fc defined by (ay)„ti = ELo C,fc(2/)fcti' («&• Charalambides, 2005). Prom ^ and dZIl), 
a X = ya then 

n n 

(ya)nn = YS~];~"{ya)kA:a = YS~]:'''a^{y)kn, 

k=0 k=Q 

hence 5;^^^'"" = a-^C^^^. 

It is also worth to clarify the relationship between non central generalized Stirling num- 
bers of the first kind as defined in Hsu and Shiue (1998) and non central generalized factorial 
coefficients as in Charalambides (2005). 

Pirst recall that non central generalized Stirling numbers of the first kind are connection 
coefficients defined by 

n 

(x)„t = 5~j;""''^(x - 7)fcta 
k=Q 
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or by the following convolution relation 

^nj'-"'^ = E (t) ^si'-^'s-'j-r- (23) 

Since as a convention we assume 5*"^'"" = for s < /c and it is known that S~^'~q'"' = 
(7)n-st tlien 

^i-"'' = E(!)^J'-"(-7)n-.ti, (24) 

s=k ^ ^ 

Now, for parameter of non centrality —7, and x = ya 

n n 

{ya - 7)nt = E ^ni'""'"^(y«)fct" = E 
fc=0 fc=0 

Then exploiting the relation between central generalized Stirling numbers and central 
generalized factorial coefficients 

n 

{ya - 7)nt = E 

and the definition of non central factorial coefficients as in Charalambides (2005) follows 

n n 

{ya - 7)„t = a-''C^'^{ya)k^a = E '^3,k{y)kt- 

k=0 k=0 
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